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Abstract 

Many Random Number Generators (RNG) are available nowadays; they are divided in two cat- 
egories, hardware RNG, that provide "true" random numbers, and algorithmic RNG, that generate 
pseudo random numbers (PRNG). Both types usually generate random numbers (X n ) n as indepen- 
dent uniform samples in a range 0, ... 2 b — 1, with 6 = 8, 16, 32 or 6 = 64. In applications, it is 
instead sometimes desirable to draw random numbers as independent uniform samples (Y n ) n in a 
range 1, . . . M, where moreover M may change between drawings. Transforming the sequence (X n ) n 
to (Y n ) n is sometimes known as scaling. We discuss different methods for scaling the RNG, both in 
term of mathematical efficiency and of computational speed. 
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1 Introduction 



We consider the following problem. We want to generate a sequence of random numbers (Y n ) n with 
a specified probability distribution, using as input a sequence of random numbers (X n ) n uniformly 
distributed in a given range. There are various methods available; these methods involve transforming 
the input in some way; for this reason, these methods work equally well in transforming both pseudo- 
random and true random numbers. One such method, called the acceptance-rejection method, involves 
designing a specific algorithm, that pulls random numbers, transforms them using a specific function, 
tests whether the result satisfies a condition: if it is, the value is accepted; otherwise, the value is rejected 
and the algorithm tries again. 

This kind of method has a defect, though: if not carefully implemented, it throws away many inputs. 
Let's see a concrete example. We suppose that we are given a RNG that produces random bits, evenly 
distributed, and independent We want to produce a random number R in the range {1, 2, 3}, uniformly 
distributed and independent. Consider the following method. 

Example 1 We draw two random bits; if the sequence is 11, we throw it away and draw two bits again; 
otherwise we return the sequence as R, mapping 00,01, 10 to 1,2,3. 

This is rather wasteful! The entropy in the returned random R is log 2 (3) = 1.585bits; there is a 1/4 
probability that we throw away the input, so the expected number of (pair of tosses) is 4/3 and then 
expected number of input bits is 8/3; all together we are effectively using only 

= 59% 

8/3 

of the input. 

The waste may be consider as an unnecessary slowdown of the RNG: if the RNG can generate 1 bit in 
1/Lts, then, after the example procedure, the rate has decreased to 1.6/xs per bit. Since a lot of effort was 
put in designing fast RNG in the near past, then slowing down the rate by +60% is simply unacceptable. 

Another problem in the example method above is that, although it is quite unlikely, we can have a 
very long run of "00" bits. This means that we cannot guarantee that the above procedure will generate 
the next number in a predetermined amount of time. 

There are of course better solutions, as this ad hoc method. 

Example 2 ([lj) We draw eight random bits, and consider them as a number x in the range 0. . .255; 
if the number is more than 3° — 1 = 242, we throw it away and draw eight bits again; otherwise we write 
x as five digits in base 3 and return these digits as 5 random samples. 

This is much more efficient! The entropy in the returned five samples is 51og 2 (3) = 7.92bits; there is 
a 13/256 probability that we throw away the input, so the expected number of 8-tuples of inputs is 
256/253 and then expected number of input bits is 2048/253; all together we are now using 

51 °g2(3) = 97% 
2048/253 /0 

of the input. 

In this paper we will provide a mathematical proof (section 2), and discuss some method (section 3), 
to optimize the scaling of a RNG. Unfortunately after writing this paper it turned out that one of the 
ideas we are presenting in section [3] was already described in pQ . This paper contains the mathematical 
proof of the method, the discussion of how best to choose parameters, discussion of its efficiency, and 
numerical speed tests. 

Remark 3 A different approach may be to use a decompressing algorithm. Indeed, e.g., the arithmetic 
encoder decoding algorithm, can be rewritten to decode a stream of bits to an output of symbols with 
prescribed probability distributions [^} Unfortunately, it is quite difficult to mathematically prove that 
such an approach really does transform a stream of independent equidistributed bits into an output of 
independent random variables. Also, the arithmetic encoder is complex, and this complexity would slow 
down the RNG, defeating one of the goals. (Moreover, the arithmetic encoder was originally heavily 
patented.) 



1 For example, repeatedly tossing a coin with the faces labeled 0, 1. 
2 If interested, I have the code somewhere in the closet 
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A note on notations. In all of the paper, ns is a nanosecond, that is 10 9 seconds. When a; is a real 
number, [x\ = f loor(x) is the largest integer that is less or equal than x. 



2 Process splitting 

Let IN = {0, 1, 2, 3, 4, 5 . . .} be the set of natural numbers. 

Let (fl,A,F) a probability space, let (E,£) be a measurable space, and X a process of i.i.d. random 
variables (X„)„ £ n defined on (Cl,A,T > ) and each taking values in (E,£). We fix an event S E £ such 
that ¥{X t eS}^0,l;we define p s = V{X, E S}. 

We define a formal method of process splitting/unsplitting. 

The splitting of X is the operation that generates three processes B,Y,Z, where B = (B n ) ne ^ is 
an i.i.d. Bernoulli process with parameter p§, and Y = (Y n ) ne ^ and Z = (Z n ) ne ^ are processes taking 
values respectively in S and E\S. The unsplitting is the opposite operation. These operations can be 
algorithmically and intuitively described by the following pseudocode (where processes are thought of as 



queues of random variables). 


procedure Splitting^ h> (B,Y,Z)) 


procedure Unsplitting((B, Y, Z) i-» X) 


initialize the three empty queues B,Y,Z 


initialize the empty queue X 


repeat 


repeat 


pop X from X 


pop B from B 


if X E S then 


if B = 1 then 


push 1 onto B 


pop Y from Y 


push X onto Y 


push Y onto X 


else 


else 


push onto B 


pop Z from Z 


push X onto Z 


push Z onto X 


end if 


end if 


until forever 


until forever 


end procedure 


end procedure 



The fact that the splitting operation is invcrtiblc implies that no entropy is lost when splitting. We 
will next show a very important property, namely, that the splitting operation preserve probabilistic 
independence. 



2.1 Mathematical formulation 

We now rewrite the above idea in a purely mathematical formulation. 
We define the Bernoulli process (-B n )neN by 



and the times of return to success as 



U = inf{fc : k > 0,B k = 1} (2) 
U n = mf{k:k >l + C/„_i, B k = l}, n>l (3) 



whereas the times of return to unsuccess are 

V = inf{fc : k > Q,B k = 0} (4) 
V n = inf{fc : k > 1 + [/„_!, B k = 0}, n > 1 (5) 

it is well known that (U n ), (V n ) are (almost certainly) well defined and finite. 
We eventually define the processes (5^)„ £ n and (Z n ) neK by 

Y n = X Un Z n = X Vn (6) 

Theorem 4 Assume that X is a process of i.i.d. random variables. Let \i be the law of X\. Then 
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• the random variables B n ,Y n , Z n are independent; and 

• the variables of the same type are identically distributed: the variables B n have parameter ¥{B n = 
1} = ps; the variables Y n have law \ S); the variables Z n have law | E\S). 

Proof. It is obvious that B is a Bernoulli process of independent variables with parameter P{B n = 1} = 
PS- 

Let K, M > 1 integers. Let u < U\ < . . . ux and Vq < v% < . . . vm be integers, and consider the 
event 

A = {U = u ,...U K = u Kl V = vo, . . . V M = v M } (7) 

If A ^ then 

A={B = b ,...,B N = b N } (8) 

where N — max{w^ , vm} and bj <E {0, 1} are suitably chosen. Indeed, supposing that N — uk > vm, 
then we use the success times, and set that bj = 1 iff j = Uk for a k < K; whereas supposing that 
N = vm > uk, then we use the unsuccess times, and set that bj = iff j = v m for a m < M. 

Let Tk.m be the family of all above events A defined as per equation ([7]), for different choices of 
(ui), (v-j); let 

J~ = [J F k.m ; 

K,M>1 

let A B C A be the sigma algebra generated by the process B. 

The above equality ^ proves that J 7 is a base for A B : it is stable by finite intersection, and it 
generates the sigma algebra A B . 

Consider again K, M > 1 integers, and events Fi,Gj G £ for i = 0, ... K, j = 0, . . .M, and the event 

C = {Y G F , . . . Y K G F K ,Z G G , . . . Z M G G M } G A ; 
let A G Fk_m non empty; we want to show that 

P(C | A) = P(G) 

this will prove that (Y, Z) are independent of B, by arbitriness of (Fi), (Gj),K, M and since J 7 is a base 
for A B . 

We fix (iii), (vj) and define A as in equation ([7]); we let AT = max{u^, vm} and define (b n ) as explained 
after equation By defining 

S l dot ^ ^0 daf ^ ^ s 

for notation convenience, we can write equation (|8| as 

A = {X G ^°,...X N € S 6 ™} . 

Let £q • • • -^n € £ be defined by 



Fk if n = Uk for & k < K 
G m if n = v m for a m < M 
_E else 



then we compute 

P(C | A) = P({X U0 G F , . ..X UK G F*,X„ g Go, . ..X VM G G M } | {^o € S b °, . ■ -X N G S b ~}) = 
P{X„ G Fp, . . . X UK g Fr-, vT„ g Go, . . . X„ M G Gm , X S S^ , . . . Xn G 5 hjv } _ 

P{X G S^,...X N G S* b «} " 

= ntoP{^ e ^ n h En} = f[ P(X n G G S bn ) = 

A' A/ 

= I] M (F fc I S 1 ) I] ^(Gm I S°) ; 

k=0 m=0 
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the last equality is due to the fact that: when n = Uk then b n = 1, when n = v m then b n = 0, and for 
all other n we have E n — E. Since the last term does not depend on A, that is, on {ui), (vj), we obtain 
that (Y, Z) are independent of B. 

The above equality then also shows that 

K M 

¥{Y Q eF Q ,...Y K e F K , Z e Go, . . . Z M G G M ] = JJ fi{F k \ S) [] n{G m \ S c ) 

k=0 m=0 

and this implies that Y, Z are processes of independent variables, distributed as in the thesis. By 
associativity of the independence, we conclude that the random variables B n , Y n , Z n are independent. □ 



3 Recycling in uniform random number generation 

We now restrict our attention to the generation of uniformly distributed integer valued random variables. 
We will say that R is a random variable of modulus M when R is uniformly distributed in the range 
0,...(M-1). 

We present an algorithm, that we had thought of, and then found (different implementation, almost 
identical idea) in PQ. We present the latter implementation. 

The following algorithm Uniform random by bit recycling in figure [T] given n, will return a ran- 
dom variable of modulus n; note that n can change between different calls to the algorithm. 

initialize the static integer variables in — 1 and r = 
procedure Uniform random by bit recycling (n) 
repeat 

while m < N do > fill in the state 

r : = 2*r + NextBitQ; 

m : = 2*m; > r is a random variable of modulus m 

end while 

q := \m/n\; > integer division, rounded down 

if r < n* q then 

d : = r mod n > remainder, is a random variable of modulus n 

r : = \r /n\ > quotient, is random variable of modulus q 

m : = q 
return d 
else 

r : = r - n*q > r is still a random variable of modulus m 

m : = m - n*q > the procedure loops back to line 3 

end if 
until forever 
end procedure 



Figure 1: Algorithm Uniform random by bit recycling 



It uses two internal integer variables, m and r, which are not reset at the beginning of the algorithm 
(in C, you would declare them as "static"). Initially, m = 1 and r = 0. 

The algorithm has an internal constant parameter N, which is a large integer such that 2N can still 
be represented exactly in the computer. We must have n < N . F] The algorithm draws randomness from 
a function NextBitQ that returns a random bit. 



Here is an informal discussion of the algorithm, in the words of the original author [T]. At line 10 
as r is between and (n * q — 1), we can consider r as a random variable of modulus n * q. As this is 
divisible by n, then d := (rmodn) will be uniformly distributed, and the quotient [r/n\ will be uniformly 
distributed between and q— 1. 

Note that the theoretical running time is unbounded; we will though show in the next section that 
an accurate choice of parameters practically cancels this problem. 



3 We will show in next section that it is best to have n << N . 
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4 Mathematical analysis of the efficiency 



We recall this simple idea. 

Lemma 5 Suppose R is a random variable of modulus MN ; we perform the integer division R = QN+D 
where Q £ {0, . . . (M — 1)} is the quotient and D £ {0, . . . (iV — 1)} is the remainder; then Q is a random 
variable of modulus M and D is a random variable of modulus N ; and Q, D are independent. 

Proposition 6 Let us assume that repeated calls of NextBit () return a sequence of independent equidis- 



tributed bits. Then the above algorithm Uniform random by bit recycling in figure 1 on the previous 



\page\ will return a sequence of independent and uniformly distributed numbers. 

Proof. We sketch the proof. We use the lemma above [5] and the theorem |4j Consider the notations in 
the second section. The bits returned by the call NextBit () builds up the process X. When reaching 
the if (line [9] in the pseudocode at page |5j, the choice r < n * q is the choice of the value of B n in 



equation (JT|). This (virtually) builds the process B. At line 10 r is a variable in the process Y; since it is 
of modulus nq, we return (using the lemma) the remainder as d, that is a random variable of modulus 
n, and push back the quotient into the state. At line [16] we would be defining a variable in the process 
Z, that we push back into the state. □ 

The "pushing back" of most of the entropy back into the state recycles the bits, and improves greatly 
the efficiency 

The only wasted bits are related to the fact that the algorithm is throwing away the mathematical 
stream B. Theoretically, if this stream would be feeded back into the state (for example, by employing 
Shannon- Fano-Elias coding), then efficiency would be exactly 100% j^] Practically, the numbers N and n 
can be designed so that this is totally unneeded. 

Remark 7 Indeed, consider the implementation (see the code in the next sections) where the internal 
state is stored as 64-bit unsigned integers, whereas n is restricted to be 32bit unsigned integer; so the 
internal constant is N — 2 62 while n £ {2 . . . 2 32 — 1}, When reaching the if at Zme[p[ m is in the range 

2 62 < 

m < 2 64 , and r is uniform of modulus m; but m — n * \m/n\ is less than n, that is, less than 2 32 ; 
so the probability that r > n* q at the if is less than 1/2 30 . 

In particular, this means that each B n in the mathematical stream B contains ~ 10~ 8 bits of entropy, so 
there is no need to recycle them. 

Indeed, in the numerical experiments we found out that the following algorithm wastes ~ 30 input 
bits on a total of ~ 10 9 input bits (!) this is comparable to the entropy of the internal state (and may 
also be due to numerical error in adding up log 2 () values). 

Also, this choice of parameters ensures that the algorithm will never practically loop twice before 
returning. When the condition in the if at line |9| is false, we will count it as a failure. In ~ 10 10 calls 
to the algorithm, we only experienced 3 failures. Tj 



5 Speed, simple vs complex algorithms 



We now consider the algorithm Uniform random simple in 2 on the following page 
Again, when the condition in the if at line [5] is false, we will count it as a failure. 
This algorithm will always call the original RNG to obtain b bits, regardless of the value of n. When 
the algorithm fails, it starts again and again draws b bits. This is inefficient in terms of entropy: for 
small values of n it will produce far less entropy than it consume. But, will it be slower or faster than our 
previous algorithm? It turns out that the answer pretty much depends on the speed of the back-end RNG 
(and this is unsurprising); but also on how much time it takes to compute the basic operations "integer 
multiplications" q * n and "integer division" \N/n\ : we will see that, in some cases, these operations are 
so slow that they defeat the efficiency of the algorithm Uniform random by bit recycling. 



4 But this would render difficult to prove that the output numbers are independent... 

5 For this reason, the else block may be omitted with no big impact on the quality of the output - we implement this 
idea in the algorithm unif orm_random_by_bit_recycling_cheating. 
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l: procedure Uniform random simple(ii) 



2: repeat 

3: r : = GetRandomBits(b); > fill the state with b bits 

4: q = \_N/n\ ; > integer division, rounded down 

5: if r < n * q then 

6: return rmodn > remainder, is random variable of modulus n 

7: end if > otherwise, start all over again 

8: until forever 



9: end procedure 

Figure 2: Uniform random simple ; in our tests N — 2 b or N = 2 b — 1, whereas b — 32, 40, 48, 64 



6 Numerical tests 
6.1 Architectures 

The tests were performed in six different architectures, 
(HW1) Intel® Core ™ 2 Duo CPU El '500 2.93GHz, in i686 mode, 
(HW2) Intel® Core ™ 2 Duo CPU PI 350 2.00GHz, in i686 mode, 
(HW3) AMD Athlon™ 64 X2 Dual Core Processor 4200+, in x86_64 mode, 
(HW4) AMD Athlon ™ 64 X2 Dual Core Processor 4800+, in x86_64 mode, 
(HW5) Intel® Core ™ 2 Duo CPU P7350 2.00GHz, in x86_64 mode, 
(HW6) Intel ® Xeon ® CPU 5160 3.00GHz , in x86_64 mode. 

In the first five cases, the host was running a Debian GNU /Linux or Ubuntu O.S. , and the code was 
compiled using gcc 4-4i with the optimization flags 

-march=native -03 -f inline-functions -f no-strict-aliasing -f omit-f rame-pointer -DKDEBUG . 

In the last case, the O.S. was Gentoo and the code was compiled using gcc 4-0 with flags 

-march=nocona -03 -f inline-functions -f no-strict-aliasing -f omit-f rame-pointer -DKDEBUG . 



6.2 Back-end PRNGs 

To test the speed of the following algorithms, we used four different back-end PRNGs. 

(sfmt_sse) The SIMD oriented Fast Mersenne Twister(SFMT) ver. 1.3.3 by Mutsuo Saito and Makoto 
Matsumoto [2] (compiled with SSE support) 

(xorshift) The xorshift generator by G. Marsaglia [3] 

(sfmt sse md5) as sfmt_sse above, but moreover the output is cryptographically protected using 

the MD5 algorithm 

(bbs260) The Blum-Blum-Shub algorithm [5], with two primes of size ~ 130bit. 

The last two were home-made, as examples of slower but (possibly) cryptographically strong RNG[^] All 
of the above were uniformized to implement two functions, my_gen_rand32() and my_gen_rand64() , 
that return (respectively) a 32bit or a 64bit unsigned integer, uniformly distributed. The C code for all 
the above is in the appendix |B.1| The speeds of the different RNGs are listed in the tables in sec. |A.1| 
|on page 1(J| 

We also prepared a simple counter "RNG" algorithm, that returns numbers that are in arithmetic 
progression; since it is very simple, it is useful to assess the overhead complexity in the testing code itself; 
this overhead is on the order of 2 to to 4 ns, depending on the CPUs. 

6 The author makes no guarantees, though, that the implemented versions are really good and cryptographically strong 
RNGs — we are interested only in their speeds. 
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6.3 Ad hoc functions 



We implemented some ad hoc functions, that are then used by the uniform RNGs (that are described in 
the next section). 

NextBit returns a bit 

Next2Bit returns two bits 

NextByte returns 8 bits 

Next Word returns 16 bits 

For any of the above, we prepared many variants, that internally call either the my_gen_rand32() or 



my_gen_rand64() calls (see the C code in sec. B.2 on page 20 1 and then we benchmarked them in all 
architecture, to choose the faster one (that is then used by the uniform RNGs). [^] 
We also prepared a specific method (that is not used for the uniform RNGs): 

NextCard returns a number uniformly distributed in the range ... 51 (it may be thought of as a card 
randomly drawn from a deck of cards) . 

The detailed timings are in the tables in sec. |A.l on page 10| 
6.4 Uniform RNGs 

We implemented nine different versions of uniform random generators. The C code is in |B.3 on page 2~3] 
we here briefly describe the ideas. Four versions are based on the "simple" generator in fig. |2J 

uniform random simple32 uses 32bit variables internally, N = 2 32 — 1, and consumes a 32bit random 

number, (a call to my_gen_rand32 ) each time 

uniform random simple40 uses 64bit variables internally, N = 2 40 , and calls my_gen_rand32() and NextByteO 

each time 

uniform random simple48 uses 64bit variables internally, N = 2 48 , and calls my_gen_rand32() and NextWordO 

each time 

uniform random simple64 uses 64bit variables internally, N = 2 64 — 1, and calls my_gen_rand64() each 

time. 

Then there are three versions based on the "bit recycling" generator in fig. [I] (all use 64bit variables 
internally) : 

uniform random by bit recycling is the code in fig. [I] (but it refills the state by popping two bits at a 

time) 

uniform random by bit recycling faster it refills the state by popping words, bytes and pairs of bits, 

for improved efficiency 

uniform random by bit recycling cheating as the "faster" one, but the if/else block is not imple- 
mented, and the modulus rmodn is always returned; this is not mathematically exact, but the probability 
that it is inexact is ~ 2 -30 . 

Moreover there are "mixed" methods 

uniform random simple recycler uses 32bit variables internally, keeps an internal state that is sometimes 

initialized but not refilled each time (so, it is useful only for small n), 

uniform random by bit recycling 32 when n < 2 29 , it implements the "bit recycling" code using 32bit 

variables; when 2 29 < n < 2 32 , it implements a "simple"-like method, using only bit shifting. 

We tested them in all of the architectures, for different values of the modulus n, and graphed the 



results (see appendix A. 2.1 on page 11 1 



7 We had to make an exception for when the back-end RNG is based on SFMT, since SFMT cannot mix 64bit and 32bit 
random number generations: in that case, we forcibly used the 32bit versions (that in most of our benchmarks are anyway 
slightly faster). 
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6.5 Timing 

To benchmark the algorithms, we computed the process time using both the Posix call clock () (that 
returns an approximation of processor time used by the program) and the CPU TSC (that counts the 
number of CPU ticks). When benchmarking one of the above back-end RNGs or the ad hoc functions, 
we called it in repeated loops of 2 24 iterations each, repeating them for at least 1 second of processor 
time; and then compared the data provided by TSC and clockO. We also prepared a statistics of the 
values 

ATSC 

cycles per clock := — 

' AclockO 

so that we could convert CPU cycles to nanoseconds; we verified that the standard deviation of the 
logarithm of the above quantity was usually less than 1%. 

To avoid over-optimization of the compiler, the results of any benchmarked function was a;or-ed in a 
bucket variable, that was then printed on screen. 

During benchmarking, we disabled the CPU power-saving features, forcing the CPU to be at max- 
imum performance (using the cpuf req-set command) and also we tied the process to one core (using 
the taskset command). 

Unfortunately the clockO call, in GNU/Linux systems, has a time resolution of O.Olsec, so it was 



too coarse to be used for the graphs in section A. 2.1 for those graphs, only the TSC was used (and then 



cycles were converted to nanoseconds, using the average value of cycles_per_clock). 
6.6 Conclusions 

While efficiency is exactly mathematically assessed, computational speed is a more complex topic, and 
sometimes quite surprising. We report some considerations. 

1. In our Intel™ CPUs running in 32bit mode, integers divisions and remainder computation using 
64bit variables are quite slow: in HW1, each such operations cost ~ 15ns. 

2. Any bit recycling method that we could think of needs at least four arithmetic operations for each 
result it produces; moreover there is some code to refill the internal state. 

3. In the same CPUs, the cost of bit shifting or xor operations are on the order of 3 ns, even on 64bit 
variables; moreover the back-ends sf mt_sse and xorshif t can produce a 32bit random number in 
~ 5 ns. 

4. So, unsurprisingly, when the back end is sfmt_sse and xorshift, and the Intel™ CPU runs in 
32bit mode, the fastest methods are the "simple32" and "simple_recycler" methods, that run in 
~ 10ns; and the bit recycling methods are at least 5 times slower than those. 

5. When the back-end is sfmt_sse and xorshift, but the the Intel™ CPU runs in 64bit mode, the 
fastest method are still the "simple32" and "simple_recycler" methods; the bit recycling methods 
are twice slower. 

6. When the back-ends RNGs are sfmt_sse or xorshift, in the AMD™ CPUs, the "simple32" 
and "simple_recycler" take ~ 40ns; this is related to the fact that 32bit division and remainder 



computation need ~ 20ns (as is shown in sec. A. 2.2 1. So these methods are much slower than the 



back-ends RNGs, that return a 32bit random number in ~ 6ns. 

One consequence is that, since the unif orm_random_by_bit_recycling_32 for n > 2 29 uses a 
"simple"-like method with only bit shifting, then it is much faster than the "simple32" and "sim- 
ple_recycler". 

What we cannot explain is that, in the same architectures, the NextCard32 function, that imple- 
ments the same type of operations, runs in ~ 8ns (!) 

(We also tried to test the above with different optimizations. Using the xorshift back-end, 
setting optimization flags to be just -00, NextCard32 function takes ~ 21ns; setting it to -0, it 
takes ~ 12ns.) 
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7. The back-end RNGs sfmt_sse_md5 or bbs260, are instead much slower, that is, sf mt_sse_md5 
needs ~ 300 ns to produce a 32bit number, and bbs260 needs ~ 500 ns when the CPU runs in 
64bit mode and more than a microsecond (!) in 32bit mode. 

In this case, the bit recycling methods are usually faster. Their speed is dominated by how many 
times the back-end RNGs is called, so it can be estimated in terms of entropy bitrate, and indeed 
the graphs are (almost) linear (since the abscissa is in logarithmic scale). 

8. One of the biggest surprises comes from the NextCard functions: there are four implementations, 

• using 64bit or 32bit variables; 

• computing a result for each call, or precomputing them and storing in an array (the "prefillcd" 
versions) . 

The speed benchmarks give discordant results. When using the faster back-ends sfmt_sse and 
xorshift, the "prefilled" versions are slower. When using the slower back-ends sfmt_sse_md5 or 
bbs260, the 64bit "prefilled" version is the fastest in Intel™ CPUs; but it is instead much slower 
than the "non prefilled" version in AMD™ CPUs. It is possible that the cache misses are playing 
a role in this, but we cannot provide a good explanation. 

9. Curiously, in some Intel™ CPUs, the time needed for an integer arithmetic operation depends 
also on the values of the operands (and not only on the bit sizes of the variable) ! See the graph in 
sec. | A . 2 . 2 1 So, the speed of the functions depend on the value of the modulus n. This is the reason 
why some the graphs are all oscillating in nature. 

In particular, when we looked at the graphs for Core 2 architectures in 32bit mode, by looking at 
the graphs of the functions simple_40, simple_48 (where N — 2 40 ,2 48 constant) we noted that 
the operations q :— N/n,qn := n* q are ~ 10 ns slower when n < N2~ 32 than when n > N2~ 32 . 
This is similar to what is seen in the graphs in sec. |A.2.2| 

Instead the speed graphs in AMD™ CPUs are almost linear, and this is well explained by the 
average number of needed operations. 

Summarizing, the speeds are quite difficult to predict; if a uniform random generator is to be used for 
n in a certain range, and the back-end RNG takes approximatively as much time as 4 integer operations 
in 64bits, then the only sure way to decide which algorithm is the fastest one is by benchmarking. If 
a a uniform random generator is to be used for a constant and specific n (such as in the case of the 
NextCard function), there may be different strategies to implement it, and again the only sure way to 
decide which algorithm is the fastest one is by benchmarking. 



A Test results 

A.l Speed of back-end RNGs and Ad hoc functions 

These tables list the average time (in nanoseconds) of the back-end RNGs and the ad hoc functions (see 



the C code in sec. B.2 on page 20); these same data are plotted as red crosses in the plots of the next 
section. For each family, the fastest function is marked blue; competitors that differ less than 10% are 
italic and blue; competitors that are slower more than 50% are red. 
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A. 2 Graphs 

In all of the following graphs, the abscissa is n, (that is the modulus of the uniform RNGs); the abscissa 
is in log-scale (precisely, it contains all n from 2 to 32, and then n is incremented by [^/32J up to 2 32 , 
for a total of 733 samples). The number in parentheses near the graph labels are the average time for 
call (in nanoseconds; averaged in the aforementioned log scale). 



A. 2.1 Uniform random generators 

To reduce the size of the labels, we abbreviated unif orm_random_by_bit_recycling as bbr, and 
unif orm_random_simple as simple. 
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A. 2. 2 Integer arithmetic 



typedef uint64_t st ; 
st div32(uint32_t n) { 

return my_gen_rand32 C ) / n ; 

} 

st div32_24(uint32_t n) { 

return (my_gen_rand32 () & OxFFOOOOOO) / n ; 

} 

st div48(uint32_t n) { 

uint64_t r = my_gen_rand32C) , m = r << 16; 
return m / n ; 

} 

st div64(uint32_t n) { 

uint64_t m = my_gen_rand64C) / n ; 
return m; 

} 

st mod32(uint32_t n) { 

return my_gen_rand32 C ) 7 P n ; 

} 

st mod32_24(uint32_t n) { 

return (my_gen_rand32 () k OxFFOOOOOO) 7. n ; 

} 

st mod48(uint32_t n) { 

uint64_t r = my_gen_rand32C) , m = r << 16; 
return m % n ; 

} 

st mod64(ulnt32_t n) { 

uint64_t m =my_gen_rand64 C ) 7 n ; 
return m; 

} 

st prod32(uint32_t n) { 

return my_gen_rand32 C ) * n ; 

} 

st prod32_24(uint32_t n) { 

return (my_gen_rand32 () k OxFF) * n ; 

} 

st prod48(uint32_t n) { 

uint64_t r = my_gen_rand32C) , m = r << 16; 
return m * n ; 

} 

st prod64(uint32_t n) { 

uint64_t m = my_gen_rand64C) * n ; 
return m; 

} 
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div64 (15) 
mod48 (13) 

div48 (13) 



+ 



mod 32 (8) 
mod32_24 (8) 
div32„24 (8) 
div32 (8) 
prod64 (6) 



prod32 (4) 
prod48 (4) 
prod32_24 (4) 



1G 



B Code 



The complete C code is available on request. The code that we wrote is licensed according to the Gnu 
Public License v2.0. (The SFMT code is licensed according to a modified BSD license — they are 
considered to be compatible). 

In the following code, the macro COUNTBITS was used in testing efficiency; and was disabled while 
testing speeds. 



B.l Back-end RNGs 

// 

//to compile this code, define RNG, by using 'gcc .... -DRNG=n', where n is 

// 1 -> SFMT , by M. Saito and M. Matsumoto 

// 2 -> SFMT + md5 

// 3 -> xorshift , by Marsaglia 

// 4 -> Blum Blum Shub with ~128bit (product of two -31bit primes) (only on amd64, using gcc 128 int types) 
// 5 -> Blum Blum Shub with ~260bit modulus (product of two ~130bit primes) 

//disclaimer: methods 2,4,5 are not guaranteed to generate high quality pseudonumbers ; they were used 
// only to test the code speed 



/*************** SFMT ***/ 

#if 1==RNG 

#ifdef HAVE_SSE2 

char *RNGNAME= " SFMT (sse)"; 

char *RNGNICK="sfmt_sse" ; 

#else 

char *RNGNAME="SFMT" ; 
char *RNGNICK="sf mt" ; 
#endif 

#include "SFMT.h" 

uint32_t my_gen_rand32() 
{ 

uint32_t r=gen_rand32() ; 
COUNTBITS (32) ; 
return r; 

} 

uint64_t my_gen_rand64() 
{ 

uint64_t r=gen_rand64() ; 
COUNTBITS (64) ; 
return r; 

} 

void my_init_gen_rand(uint32_t seed) 
{ 

init_gen_rand(seed) ; 

} 

/*************** SFMT + md5 ***/ 
#elif 2==RNG 
#ifdef HAVE_SSE2 

char *RNGNAME= " SFMT (sse) + md5"; 
char *RNGNICK="sfmt_sse_md5"; 
#else 

char *RNGNAME="SFMT + md5" ; 
char *RNGNICK="sfmt_md5"; 
#endif 

#include "SFMT.h" 
#include "md5.h" 



uint32_t my_gen_rand32 () 
{ 

uint32_t I [8] ; 

for(int i=0;i<8;i++) I [i] =gen_rand32() ; 
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unsigned char *UCp, output [16] ; 
UCp= (unsigned char*) I; 
md5 (UCp , 32 , output ) ; 

uint32_t *U32p=(uint32_t *) output, r=*U32p; 
C0UNTBITS(32) ; 
return r; 



uint64_t my _gen_rand64 ( ) 
{ 

uint32_t I [8] ; 

for(int i=0;i<8;i++) I [i] =gen_rand32() ; 
unsigned char *UCp, output [16] ; 
UCp= (unsigned char*) I; 
md5 (UCp , 32 , output ) ; 

uint64_t *U64p=(uint64_t *)output, r=*U64p; 
C0UNTBITS(64) ; 
return r; 



void my_init_gen_rand(uint32_t seed) 
{ 

if (md5_self _test (0) ) exit (4) ; 
init_gen_rand(seed) ; 

} 

/*************** xorshift , by Marsaglia***/ 
#elif 3==RNG 

char *RNGNAME="xorshif t" ; 
char *RNGNICK="xorshift"; 



static uint32_t 
static uint32_t 
static uint32_t 
static uint32_t 



_xorshif t 
_xorshif t 
_xorshif t 
_xorshif t 



— y 



123456789 
362436069 
521288629 
88675123; 



uint32_t my_gen_rand32 (void) { 
uint32_t t; 

t = xorshift x ~ ( xorshift x << 11); 

xorshift x = xorshift y; xorshift y = xorshift z; xorshift z = xorshift w; 

C0UNTBITS(32) ; 

return xorshift w = xorshift w ~ ( xorshift w >> 19) (t ~ (t >> 8) ) ; 

} 

uint64_t my _gen_rand64 ( ) 
{ 

uint64_t a=my_gen_rand32() , r = (a « 32) I my_gen_rand32 ( ) ; 
return r; 

} 

void my_init_gen_rand(uint32_t seed) 
{ 

__xorshift__x = 123456789 ~ seed; 
__xorshift__y = 362436069; 

xorshift z = 521288629; 

__xorshift__w = 88675123; 

} 

/******************** Blum Blum Shub (needs amd64 arch) *******************/ 
#elif 4==RNG 

char *RNGNAME="Blum Blum Shub (64bit)"; 
char *RNGNICK="bbs64" ; 



uint64_t my_seed=0x987fed5; 

typedef __uintl28_t uintl28_t; 

uint32_t my_gen_rand32 (void) { 

const uint64_t p = 4222234259UL; //- 2*31.9 
const uint64_t q = 4222231271UL; 
const uint64_t M = p * q ; 

const uint64_t mask = OxFFFFFFFFFFFFFFFFUL ; 
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uintl28_t a = my_seed, b = a * a , my_seed = b '/, M; 
my_seed = c & mask ; 
C0UNTBITS(32) ; 

return my_seed & OxFFFFFFFFF; 



uint64_t my _gen_rand64 ( ) 
{ 

uint64_t a=my_gen_rand32() , r = (a « 32) I my_gen_rand32 ( ) ; 
return r; 

} 

void my_init_gen_rand(uint32_t seed) 
{ 

my_seed=seed " 0x987fed5 ; 

} 

/***********************************************************/ 
/******************** Blum Blum Shub *******************/ 
#elif RNG==5 

char *RNGNAME="Blum Blum Shub (260bit)"; 
char *RNGNICK="bbs260" ; 
#include "gmp.h" 

mpz_t bbs pq , bbs n , bbs 64 , bbs ; 

int initialized = 0; 

void my_init_gen_rand(uint32_t seed) 
{ 

if (! initialized) { 

//http : / /primes . utm . edu/lists/ small/ small . html 

mpz_t p; mpz_init_set_str (p, "3615415881585117908550243505309785526231", 10); 
assert (mpz_probab_prime_p(p, 12)) ; 

mpz_t q; mpz_init_set_str (q, "5570373270183181665098052481109678989411", 10); 
assert (mpz_probab_prime_p(q, 12)) ; 

mpz_mul(bbs pq ,p , q) ; 

mpz_clear (p) ; mpz_clear(q) ; 

mpz_init(bbs n ) ; 

initialized=l ; 

} 

unsigned long s=seed+0xl00000000 ; 
mpz_set_ui(bbs n , s) ; 



void bbs_step() 
{ 

mpz_t sqr; mpz_init (sqr) ; 

mpz_mul(sqr, bbs n , bbs n 

mpz_tdiv_r (bbs n , sqr, bbs 

mpz_clear (sqr) ; 

} 

uint32_t my_gen_rand32() 
{ 

bbs_step() ; 

unsigned long int r=mpz_get_ui (bbs n ); 

C0UNTBITS(32) ; 

if( sizeof (unsigned long int) == 4) { 

return r; 
} else { 

uint32_t rr=r & OxFFFFFFFF; 

return rr; 

} 



uint64_t my_gen_rand64 
{ 

bbs_step() ; 

unsigned long int r=mpz_get_ui (bbs n ); 

C0UNTBITS(64) ; 

if ( sizeof (unsigned long int) == 8 ) { 



__); 

— pq— ) ; 
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return r; 
} else { 

mpz_t q; mpz_init (q) ; mpz_tdiv_q_2exp (q, bbs n , 32); 

uint64_t r2=mpz_get_ui (q) ; 
mpz_clear (q) ; 

uint64_t rr=r I (r2 « 32) ; 
return rr; 

} 

} 

/*************** a counter , to test speeds ***/ 

#elif 11==RNG 

char *RNGNAME=" counter" ; 

char *RNGNICK=" counter" ; 

static uint32_t my_seed32=0 ; 
static uint64_t my_seed64=0 ; 

uint32_t my_gen_rand32 (void) { 
C0UNTBITS(32) ; 
return my_seed32 += 0x4cl; 

} 

uint64_t my _gen_rand64 ( ) 
{ 

COUNTBITS (64) ; 

return my_seed64 += 0x4c7000004cl ; 

} 

void my_init_gen_rand(uint32_t seed) 
{ 

my_seed32=seed; 
my_seed64=seed; 

} 

#else 

terror "please define RNG" 

#endif 

// 

B.2 ad hoc functions 

// 

unsigned int NextByte32() { 
static int 1=0; 
static uint32_t R=0; 
if (unlikely (1<=0)) { 
R=my _gen_r and32 ( ) ; 
1=4; 

} 

unsigned int byte=R&255; 

l—; 

if(l) R»=8; 
return byte; 

} 

unsigned int NextByte64() { 
static int 1=0; 
static uint64_t R=0; 
if (unlikely (1<=0)) { 
R=my _gen_r and64 ( ) ; 
1=8; 

} 

unsigned int byte=R & 255; 

l—; 

if(l) R»=8; 
return byte; 

} 

unsigned int saved_bytes [8] ; 

unsigned int NextByte64_pref illedO { 
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static int 1=0; 

if (unlikely (1<=0)) { 

uint64_t R=my_gen_rand64() 

saved_bytes[0] = R & 255 

R »= 8 

_bytes[l] = R & 255 



bytes [2] = R & 255 

bytes [3] = R & 255 

bytes [4] = R & 255 

bytes [5] = R & 255 

bytes [6] = R & 255 



saved_ 

R »= 

saved_ 

R »= 8; 

saved_ 

R »= 8; 

saved_ 

R »= 8; 

saved_ 

R »= 8; 

saved_ 

R »= 8; 

__saved_bytes[7] = R k 255; 
1=8; 

} 

l—; 

return saved_bytes [1] ; 

} 

unsigned int NextWord32() { 
static int 1=0; 
static uint32_t R=0; 
if(l<=0) { 

R=my _gen_r and32 ( ) ; 
1=2; 

} 

unsigned int bytes=R & OxFFFF; 
R»=16; 

l—; 

return bytes; 

} 

unsigned int NextWord64() { 
static int 1=0; 
static uint64_t R=0; 
if (unlikely (1<=0)) { 
R=my _gen_r and64 ( ) ; 
1=4; 

} 

unsigned int bytes=R & OxFFFF; 

R»=16; 

1— i 

return bytes; 

} 

unsigned int Next2Bit32() { 
static int 1=0; 
static uint32_t R=0; 
if (unlikely (1<=0)) { 
R=my _gen_r and32 ( ) ; 
1=16; 

} 

unsigned int bit=R&3; 
R»=2; 

l—; 

return bit; 

} 

unsigned int Next2Bit64() { 
static int 1=0; 
static uint64_t R=0; 
if (unlikely (1<=0)) { 
R=my _gen_r and64 ( ) ; 
1=32; 

} 

unsigned int bit=R&3; 
R»=2; 

l—; 

return bit; 



const static uint32_t bitmasks [32] = { 

0x1, 0x2, 0x4, 0x8 , 0x10, 0x20, 0x40, 0x80, 

0x100, 0x200, 0x400, 0x800, 0x1000, 0x2000, 0x4000, 0x8000, 

0x10000, 0x20000, 0x40000, 0x80000, 0x100000, 0x200000, 0x400000, 0x800000, 

0x1000000, 0x2000000, 0x4000000, 0x8000000, 0x10000000, 0x20000000, 0x40000000, 0x80000000 

}; 

unsigned int NextBit32_by_mask() { 
static int 1=0; 
static uint32_t R=0; 
if (unlikely (1<=0)) { 
R=my _gen_r and32 ( ) ; 
1=32; 

} 

l—; 

return (R & bitmasks [1] ) ? 1 : 0; 

} 

unsigned int NextBit32() { 
static int 1=0; 
static uint32_t R=0; 
if (unlikely (1<=0)) { 
R=my _gen_r and32 ( ) ; 
1=32; 

} 

unsigned int bit=R&l; 
R»=l; 

l—; 

return bit; 

} 

unsigned int NextBit64() { 
static int 1=0; 
static uint64_t R=0; 
if (unlikely (1<=0)) { 
R=my _gen_r and64 ( ) ; 
1=64; 

} 

unsigned int bit=R&l; 
R»=l; 

l—; 

return bit ; 

} 

//draws one card from a deck of 52 cards, using 32bit variables 
unsigned int NextCard32() { 
static int 1=0; 
static uint32_t R=0; 
if (unlikely (1<=0)) { 
R=my_gen_rand32 () ; 
1=5; 

} 

i-; 

unsigned int c = R '/, 52; 
R /= 52; 
return c; 

} 

//draws one card from a deck of 52 cards, using 64 bit variables 

unsigned int NextCard64() 

{ 

static int 1=0; 
static uint64_t R=0; 
if (unlikely (1<=0)) { 

R=my_gen_rand64() ; 

1=11; 

} 

i— ; 

unsigned int c = R '/, 52; 
R /= 52; 
return c; 

} 

//draws one card from a deck of 52 cards, using 32 bit variables 
//and prefilling an array in memory 
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unsigned char 32saved_cards [5] ; 

unsigned int NextCard32_pref illedO { 
static int 1=0; 
if (unlikely (1<=0)) { 

uint32_t R=my_gen_rand32 ( ) ; //52**5 < 2**32 

32saved_cards [0] = R '/, 52; 

R /= 52; 

__32saved_cards[l] = R 'I, 52; 
R /= 52; 

__32saved_cards [2] = R % 52; 
R /= 52; 

__32saved_cards [3] = R % 52; 
R /= 52; 

__32saved_cards [4] = R % 52; 
1=5; 

} 

i~; 

return 32saved_cards [1] ; 

} 

//draws one card from a deck of 52 cards, using 64 bit variables 
//and prefilling an array in memory 

unsigned char 64saved_cards [11] ; 

unsigned int NextCard64_pref illedO { 

static int 1=0; 

if (unlikely (1<=0)) { 

uint64_t R=my_gen_rand64() ; //52**11 < 2**64 





_64saved_ 


.cards [0] 




R / 


. 52 


R 


/= 52; 












_64saved_ 


.cards [1] 




R / 


. 52 


R 


/= 52; 












_64saved_ 


.cards [2] 




R / 


. 52 


R 


/= 52; 












_64saved_ 


.cards [3] 




R / 


. 52 


R 


/= 52; 












_64saved_ 


.cards [4] 




R / 


. 52 


R 


/= 52; 












_64saved_ 


.cards [5] 




R 5 


. 52 


R 


/= 52; 












_64saved_ 


.cards [6] 




R / 


. 52 


R 


/= 52; 












_64saved_ 


.cards [7] 




R / 


. 52 


R 


/= 52; 












_64saved_ 


.cards [8] 




R / 


. 52 


R 


/= 52; 












_64saved_ 


.cards [9] 




R / 


. 52 


R 


/= 52; 











__64saved_cards[10] = R '/, 52; 
1=11; 

} 

i—; 

return 64saved_cards [1] ; 

} // 

B.3 Uniform random generators 

In the following code, the macro COUNTFAILURES was used in testing efficiency; and was disabled while 
testing speeds. 

// 

/ ********** unif orm_random_by_bit_recycling 

this implements the pseudocode in page 5 

(but pops 2 bits at a time) 
*******************/ 

uint32_t unif orm_random_by_bit_recycling(uint32_t n) 
{ 

static uint64_t m = 1, r = 0; 

const uint64_t N62=( (uint64_t) 1) «62; 

while (1) { 

while(m<N62) { //fill the state 
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r = (r«2) I Next2Bit() ; 
m = m«2; 

} 

const uint64_t q=m/n, nq = n * q; 
if( likely (r < nq) ) { 

uint32_t d = r '/, n; //remainder, is a random variable of modulus n 

r = r/n; //quotient, is random variable of modulus q 

m = q; 

return d ; 
} else { 

COUNTFAILURES O ; 

m = m - nq; 

r = r - nq; // r is still a random variable of modulus m 

} 

} 

} 

/ ********** unif orm_random_by_bit_recycling 

this implements the pseudocode in page 5 , 

but pops words, bytes, and pairs of bits 
*******************/ 

uint32_t unif orm_random_by_bit_recycling_f aster (uint32_t n) 
{ 

static uint64_t m = 1, r = 0; 

const uint64_t N62=( (uint64_t) 1) «62 , N56=((uint64_t) 1)«56, N48=( (uint64_t) 1) «48 ; 
while (1) { 

//fill the state 
if(m<N48) { 

r = (r«16) I NextWordO; 
m = m«16; 

} 

while (m<N56) { 

r = (r«8) I NextByteO; 
m = m<<8; 

} 

while (m<N62) { 

r = (r«2) I Next2Bit() ; 
m = m«2; 

} 

const uint64_t q=m/n, nq=n*q; 
if( likely (r < nq) ) { 

uint32_t d = r '/, n; //remainder, is a random variable of modulus n 

r = r/n; //quotient, is random variable of modulus q 

m = q; 

return d ; 
} else { 

COUNTFAILURES () ; 

m = m - nq; 

r = r - nq; // r is still a random variable of modulus m 



} 

/ ********** unif orm_random_by_bit_recycling_cheating 
this implements the pseudocode in page 6 , 
but does not implement the "else" block, so it is not 
mathematically perfect; at the same time, since 
the probability of the "else" block would be less than l/2~24 
the random numbers generated by this function are good enough 
for most purposes 

*******************/ 

uint32_t unif orm_random_by_bit_recycling_cheating(uint32_t n) 
{ 

static uint64_t m = 1, r = 0; 

const uint64_t N48=( (uint64_t) 1) «48, N56=((uint64_t) 1)«56; 
if(m<N48) { //fill the state 

r = (r«16) I NextWordO ; 

m = m«16; 

} 

while (m<N56) { 

r = (r«8) I NextByteO; 
m = m<<8; 
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} 

uint32_t d = r '/. n; 
r = r/n; 
m = m/n; 
return d; 

} 

/ ********** unif orm_random_by_bit_recycling32 
this implements the pseudocode in page5 
but only using 32bit variables, so it has some special 
methods when n>=2~29 

uint32_t unif orm_random_by_bit_recycling32(uint32_t n) 
{ 

const uint32_t N29=( (uint32_t) 1) «29, N30=( (uint32_t) 1) «30 , N31=( (uint32_t) 1) «31 , N24=( (uint32_t) 1)«24; 
//special methods 
if(n>=N31) { 
while (1) { 

uint32_t r = my_gen_rand32() ; 
if( likely (r < n) ) { 

return r; 
} else { 

COUNTFAILURESO ; 

} 

}} 

if(n>=N30) { 
while (1) { 

uint32_t r = my_gen_rand32() » 1; 
if( likely (r < n) ) { 

return r; 
} else { 

COUNTFAILURESO ; 

} 

}} 

if(n>=N29) { 
while(l) { 

uint32_t r = my_gen_rand32() » 2; 
if( likely (r < n) ) { 

return r; 
} else { 

COUNTFAILURESO ; 

} 

}} 

//usual bit recycling 

static uint32_t m = 1, r = 0; 

while (1) { 

while(m<N24) { //fill the state 

r = (r«8) I NextByteO ; 

m = m<<8; 

} 

while(m<N30) { //fill the state 
r = (r«2) I Next2Bit() ; 
m = m«2; 

} 

uint32_t q=m/n, nq=q*n; 
if (likely ( r < nq) ) { 

uint32_t d = r '/, n; //remainder, is a random variable of modulus n 

r = r/n; //quotient, is random variable of modulus q 

m = q; 

return d; 
} else { 

COUNTFAILURESO ; 

m = m - nq; 

r = r - nq; // r is still a random variable of modulus m 

} 

} 

} 

/ ********** unif orm_random_simple_64 

this is a simple implementation, found in many random number libraries 
this version uses 64 bit variables 
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uint32_t unif orm_random_simple_64(uint32_t n) 
{ 

const uint64_t N=OxFFFFFFFFFFFFFFFFU; //2"64-l; 
uint64_t q = N/n, nq=( (uint64_t)n) * q; 
while (1) { 

uint64_t r = my _gen_r and64 ( ) ; 
if( likely (r < nq) ) { 

uint32_t d = r '/, n; //remainder, is a random variable of modulus n 
return d ; 
} else { 

COUNTFAILURES ; 

} 

} 

} 

unif orm_random_simple 
this is a simple implementation, found in many random number libraries 
this version uses 32 bit variables 

*******************/ 

uint32_t unif orm_random_simple_32 (uint32_t n) 

{ 

const uint32_t N=OxFFFFFFFFU; // 2*32-1; 
uint32_t q = N/n; 
while (1) { 

uint32_t r = my_gen_rand32 () ; 
if( likely (r < n * q) ) { 

uint32_t d = r '/, n; //remainder, is a random variable of modulus n 
return d ; 
} else { 

COUNTFAILURES () ; 

} 

} 

} 

/ ********** unif orm_random_simple_recycler 

this is a simple implementation, with recycling for small n 

this version uses 32 bit variables 
*******************/ 

uint32_t unif orm_random_simple_recycler (uint32_t n) 
{ 

static uint32_t _r=0, _m=0; 

const uint32_t N=OxFFFFFFFFU; //2~32-l 

if (_m>n) { 

uint32_t d = _r "/, n; 

_m/ =n ; 

_r/=n; 

return d; 

} 

const uint32_t q = N/n, nq=n*q; 
while (1) { 

uint32_t newr = my_gen_rand32 () ; 

if (newr < nq) { 

uint32_t d = newr '/, n; //newr is a random variable of modulus n 
if (_m<q) { //there is more entropy in newr than in _r 
_m=q; 

_r=newr/n; 

} 

return d ; 
} else { 

COUNTFAILURES () ; 

} 

} 

} 

/********** unif orm_random_simple 

some alternative versions, using 64 bit variables 
*******************/ 

uint32_t unif orm_random_simple_40 (uint32_t n) 
{ 

const uint64_t N=( (uint64_t) 1)«40 ; 
uint64_t q = N/n, nq=( (uint64_t)n) * q; 
while (1) { 
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uint64_t r = my_gen_rand32 () ; 

r=(r<<8) I NextByteO; //create 40 bits random numbers 
if( likely (r < nq) ) { 

uint32_t d = r % n; //remainder, is a random variable of modulus n 

return d; 
} else { 

C0UNTFAILURESO ; 

} 

} 

} 

uint32_t unif orm_random_simple_48 (uint32_t n) 
{ 

const uint64_t N=( (uint64_t) 1) «48 ; 
uint64_t q = N/n, nq=( (uint64_t)n) * q; 
while (1) { 

uint64_t r = my_gen_rand32 () ; 

r=(r<<16) I NextWordO ; //create 48 bits random numbers 
if( likely (r < nq) ) { 

uint32_t d = r % n; //remainder, is a random variable of modulus n 

return d; 
} else { 

C0UNTFAILURESO ; 

} 

} 

} 
// 
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